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Abstract. We consider the problem of reinforcement learning using 

function approximation, where the approximating basis can change dy- 

^>, namically while interacting with the environment. A motivation for such 

an approach is maximizing the value function fitness to the problem 
faced. Three errors are considered: approximation square error, Bellman 
residual, and projected Bellman residual. Algorithms under the actor- 
critic framework are presented, and shown to converge. The advantage 
of such an adaptive basis is demonstrated in simulations. 



1 Introduction 



Reinforcement Learning (RL) |1] is an approach for solving Markov Decision 
Processes (MDPs), when interacting with an unknown environment. One of the 
main obstacles in applying RL methods is how to cope with a large state space. 

ly-v ' In general, the underlying methods are based on dynamic programming, and 

^s) , include adaptive schemes that mimic either value iteration, such as Q-learning, or 

policy iteration, such as Actor-Critic (AC) methods. While the former attempt to 
directly learn the optimal value function, the latter are based on quickly learning 

lO ' the value of the currently used policy, followed by a slower policy improvement 

^P I step. In this paper we focus on AC methods. 

There are two major problems when solving MDPs with a large state space. 
The first is the storage problem, i.e., it is impractical to store the value function 
and the optimal action explicitly for each state. The second is generalization: 
some notion of similarity between states is needed since most states are not 
^ • visited or visited only a few times. Thus, these issues are addressed by the 

5^ \ Function Approximation (FA) approach [3], that involves approximating the 

value function by functional approximators with a smaller number of parameters 
in comparison to the original number of states. The success of this approach 
rests mainly on selecting appropriate features, and on a proper choice of the 
approximation architecture. In a linear approximation architecture, the value 
of a state is determined by linear combination of the low dimensional feature 
vector. In the RL context, linear architectures enjoy convergence results and 
performance guarantees (e.g., |1|). 

The approximation quality depends on the choice of the basis functions. 
In this paper we consider the possibility of tuning the basis functions on-line, 
under the AC framework. As mentioned before, an agent interacting with the 
environment is composed of two sub-systems. The first is a critic, that estimates 



the value function for the states encountered. This sub-system acts on a fast 
time scale. The second is an actor, that based on the critic output, and mainly 
the temporal- difference (TD) signal, improves the agent's policy using gradient 
methods. The actor operates on a second time scale, slower than the time-scale of 
the critic. Bhatnagar et al. [5] proved that such an algorithm with an appropriate 
relation between the time scales, converges. 

We suggest to add a third time scale that is slower than both the critic 
and the actor, minimizing some error criteria while adapting the critic's basis 
functions to better fit the problem. Convergence of the value function, policy 
and the basis is guaranteed in such an architecture, and simulations show that 
a dramatic improvement can be achieved using basis adaptation. 

Using multiple time scales may pose a convergence drawback at first sight. 
Two approaches may be applied in order to overcome this problem. First, a recent 
work of Mokkadem and Pelletier [H] , based on previous research by Polyak [T5] 
and others, have demonstrated that coinbining the algorithm iterates with the 
averaging method of [13) leads to convergence rate in distribution that is the 
same as the optimal rate. Second, in multiple time scales the rate between the 
time steps of the slower and faster time scales should converge to 0. Thus, time 
scales which arc close, operate on the fast time scale, and satisfy the condition 
above, are easy to find for any practical needs. 

There are several works done in the area of adaptive bases. These works do 
not address the problem of policy improvement with adaptive bases. We mention 
here two noticeable works which are similar in spirit to our work. The first work 
is of Menache et al. [TT]. Two algorithms were suggested for adaptive bases by 
the authors: one algorithm is based on gradient methods for least-squares TD 
(LSTD) of Bardtke and Barto [5], and the other algorithm is based on the cross 
entropy method. Both algorithms were demonstrated in simulations to achieve 
better performance than their fixed basis counterparts but no convergence guar- 
antees were supplied. Yu and Bertsekas [12] suggested several algorithms for 
two main problem classes: policy evaluation and optimal stopping. The former 
is closer to our work than the latter so we focus on this class. Three target 
functions were considered in that work: mean TD error. Bellman error, and pro- 
jected Bellman error. The main difference between |19[ and our work (besides 
the policy improvement) is the following. The algorithmic variants suggested in 
[T5] are in the flavor of LSTD and LSPE algorithms [3] , while in our work the 
algorithms are TD based, thus, in our work no matrix inversion is involved. Also, 
we demonstrate the effectiveness of the algorithms in the current work. 

The paper is organized as follows. In Section [2] we define some preliminaries 
and outline the framework. In Section [3] we introduce the algorithms suggested 
for adaptive bases. In Section U we show the convergence of the algorithms 
suggested, while in Section [5] we demonstrate the algorithms in simulations. In 
Section m we discuss the results. 



2 Preliminaries 

In this section, we introduce the framework, review actor-critic algorithms, overview 
multiple time scales stochastic approximation (MTS-SA), and state a related 
theorem which will be used later in proving the main results. 

2.1 The Framework 

We consider an agent interacting with an unknown environment that is modeled 
by a Markov Decision Process (MDP) |14j in discrete time with a finite state 
set X and an action set U where N = \X\. Each selected action u £ U oi the 
agent determines a stochastic transition matrix P„ = [Pu{y\x)]x,y£X, where y is 
the state followed the state x. 

For each state x € X the agent receives a corresponding reward g{x) that 
depend only on the current statq^. The agent maintains a parameterized policy 
function which is a probabilistic function, denoted by fig{u\x)^ mapping an ob- 
servation X £ X into a probability distribution over the controls U. The param- 
eter 6 £ ]R " is a tunable parameter where fig{u\x) is a diffcrentiablc function 
w.r.t. 9. We note that for different 6''s, different probability distributions over 
U may be associated for each x £ X. We denote by xo,uo,go,xi,ui,gi, . . . a 
state-action-reward trajectory where the subindex specifies time. 

Under each policy induced by ng{u\x), the environment and the agent in- 
duce together a Markovian transition function, denoted by Pg(j/|x), satisfying 
Pg{y\x) = '^^ iig{u\x)Pu{y\x). The Markovian transition function Pg{y\x) in- 
duces a stationary distribution over the state space X, denoted by D{d). This 
distribution induces a natural norm, denoted by IMI^ifg'), which is a weighted 

norm and is defined by ||a;l|^(g) = x^ D{9)x. Note that when the parameter 9 
changes, the norm changes as well. We denote by F,g[-] the expectation operator 
w.r.t. the measures Pg{y\x) and D{9). There are several performance criteria 
investigated in the RL literature that differ mainly on their time horizon and 
the treatment of future rewards [J. In this work we focus on average reward 
criteria defined by 

r,g=Eg[g{x)]. (1) 

The agent's goal is to find the parameter 9 that maximizes rjg. Similarly, define 
the (differential) value function as 



J{x) ^ Eg 



E(3(^ 



Ve, 



Xq = X 



(2) 



where r = min{fc > 0\xk — x*} and x* is some recurrent state for all policies, 
we assume to exist. Define the Bellman operator as TJ(x) = r — rj + Eg[J (y)\x\. 
Thus, based on ^ it is easy to show the following connection between the 
average reward to the value function under a given policy [31, i.e., 

J{x) = g{x) - r; + Eg[J{y)\x] ^ TJ{x), (3) 



^ Generalizing the results presented here to state-action rewards is straight forward. 



For later use, we denote by TJ and J the column representations of J{x) and 
TJ{x) respectively. 

We define the Temporal Difference (TD) |4I16[ of the state x followed by 
the state y as d (x, y) = g{x) — rj + J{y) — J{x), where for a specific time n we 
abbreviate d(xmXn+i) as dn- Based on ([3|) we can see that 

Eg[d{x,y)\x]=0, and Ee[d{x,y)] ^ 0. (4) 

Based on this property, a wide family of algorithms known as TD algorithm exist 
[1|, where common to all these algorithms is solving ^ iteratively. 

Notational comment: from now on, we omit the dependency on 9 whenever 
it is clear from the context. 



2.2 Actor-Critic Algorithms 

A well known class of RL approaches is the so called actor-critic (AC) algo- 
rithms, where the agent is divided into two components, an actor and a critic. 
The critic functions as a state value estimator using the so called TD-learning 
algorithm, whereas the actor attempts to select actions based on the TD sig- 
nal estimated by the critic. These two components solve their own optimization 
problems separately interacting with each other. 

The critic typically uses a function approximator which approximates the 
value function in a subspace of a reduced dimension R^'' . Define the basis matrix 

<!> = [MXn)]l<n<N.l<k<K,. G IR'^'^^^ (5) 

where its columns span the subspace M.^^ . Thus, the approximation to the value 
function is J{x, r) = (p (x) r, where r is the solution of the following quadratic 
program r = argmin^/gu/Cr ||^r' — J||^. This solution yields the linear projection 
operator, 

that satisfies 



n = ^{<^' De^j <P ' De (6) 

Jir) = nj. (7) 

where J{r) is the vector representation of J{x,r). Abusing notation, we define 
the (state dependent) projection operator on J{x) as J{x) = 11 J (x). 

As mentioned above, the actor receives the TD signal from the critic, where 
based on this signal, the actor tries to select the optimal action. As described 
in Section |2.1[ the actor maintains a policy function fj,g{u\x). In the following, 
we state a theorem that serves as the foundation for the policy gradient algo- 
rithm described later. The theorem relates the gradient w.r.t. 6 of the average 
reward, Vgijg, to the TD signal, d(x,y). Define the likelihood ratio derivative as 
ipe{x,u) = VefJ,e{u\x)/ fie{u\x). We omit the dependency of ip on x, u, and 9 
through that paper. The following assumption states that ip is bounded. 

Assumption 1. For all x & X , u & U, and 9 G ISJ^" , there exists a positive 
constant, B^, such that ||V'll2 i \\'^e4'\\2 — ^>P "< oo- 



Based on this, we present the fohowing lemma that relates the gradient of 77 to 
the TD signal [S]. 

Lemma 2. The gradient of the average reward (w.r.t. to 6) can be expressed by 
VeV ='E['ip0{x,u)d{x,y)]. 

2.3 Multiple Time Scales Stochastic Approximation 

Stochastic approximation (SA), and in particular the ODE approach [9|, is a 
widely used method for investigating the asymptotic behavior of stochastic iter- 
ates. For example, consider the following stochastic iterate 

where {Cn+i} is some random process and {a„} are step sizes that form a positive 
series satisfying conditions to be defined later. The key idea of the technique is 
the following. Suppose that the iterate can be decomposed into a mean function, 
denoted by F{-), and a noise term (martingale difference noise), denoted by 

Mn+l, 

ipn+1 = (Pn+ anG{ipn)Xn+l) = Ifin + Ctn iF{(pn) + M„+i) , (8) 

and suppose that the effect of the noise weakens due to repeated averaging. 
Consider the following ODE which is a continuous version of ip and F{-) 

^t = (Fi^t)) , (9) 

where the dot above a variable stands for a time derivative. Then, a typical 
result of the ODE method in the SA theory suggests that the asymptotic limit 
of (|5]) and ^ arc identical. 

The classical theory of SA considers an iterate, which may be in some finite 
dimensional Euclidean space. Sometimes, we need to deal with several multidi- 
mensional iterates, dependent one on the other, and where each iterate operates 
on different timescale. Surprisingly, this type of SA, called multiple time scale 
SA (MTS-SA), is sometimes easier to analyze, with respect to the same iterates 
operate on single timescale. The first analysis of two time-scales SA algorithms 
was given by Borkar in [6] and later expanded to MTS by Leslie and Collins 
in [To]. In the following we describe the problem of MTS-SA, state the related 
ODEs, and finally state the conditions under which MTS-SA iterates converge. 
We follow the definitions of [T^. 

Consider L dependent SA iterates as the following 

Vl^i = ^ + al:^ {f^'^ {vi'\ . • . , ^r') + Mi%) , 1 < ^ < L, (10) 

where i/?!*^ 6 R'^', and F^*) : R®j^=i''^ -^ R'^\ The following assumption contains 
a standard requirement for MTS-SA step size. 

Assumption 3. (MTS-SA step size assumptions) 



(*) — ^, \^°° / ^,(*) 



1. For 1 <n < L, we have J2n=Q "" = °°' E)i=o ("" ) '^ °°' 

2. For 1 < n < L — 1, we have lim„_>.oo ail / an — 0. 

We interpret the second requirement in the fohowing way: the higher the index 
i of an iterate, it operates on higher time scale. This is because that there exists 
some no such that for ah n > hq the step size of the i-th iterate is larger uniformly 
then the step size of the iterates 1 < j < i — 1- Thus, the z-th iterate advances 
more than any of the iterates 1 < j < i — 1, or in other words, it operates on 
faster time scale. The following assumption aggregates the main requirement for 
the MTS-SA iterates. 

Assumption 4. (MTS-SA iterate assumptions) 

1. F^'"' (•) are gloablly Lipschitz continuous, 

2. For 1 < i < L, we have sup„ 



(0 



< oo. 



3. For 1 <i < L, X]fc=o '^fe -^fc+i converges a.s. 

4- (The ODEs requirements) Remark: this requirement is defined recursively 
where requirement (a) below is the initial requirement related to the L-th 
ODE, and requirement (b) below describes the i-th ODE system that is re- 
cursively based on the [i + l)-th ODE system, going from i = L — 1 to i — 1. 
Denote ip^^^^^ = (y^f'"' , . . . , tp^i) ) . 

(a) Define the L-th ODE system to be 

and suppose the initial condition (f\ ^ fo- Then, there exists 

t=o 

a Lipschitz continuous function ^^^'{^q) such that the ODE system (jlip 
converges to the point (iy9n,C ('/'o))- 

(b) Define the i-th ODE system, i = L — 1, . . . ,1, to be 





,^(-i),^W,^(m)(^^^^W))^ 



(12) 



where 0*^"*^ ('i ') ^s determined by the {i + l)-th ODE system, and suppose 
the initial condition ipl * = ^a- Then, there exists a Lipschitz 

continuous function £,^^^ (ifio) such that the ODE system (|12p converges 
to the point ((^07^)- 

The first two requirements are common conditions for SA iterates to converge. 
The third requirement ensures the noise term asymptotically vanishes. The 
fourth requirement ensures (using a recursive definition) that for each time scale 
i, where the slower time scales 1, ... ,i — 1 are static and where for the faster 
time scales i-\-l,. . . ,L there exists a function ^(.^+1-^^) (■) (which is the solution 



of the i + I ODE system), there exists a Lipschitz convergent function. Based 
on these requirements, we cite the following theorem due to Leslie and Collins 

m- 

Theorem 5. Consider the iterate (|10l) and suppose Assumption\3l andl^hold. 
Then, the asymptotic behavior of the iterates pOp converge to the invariant set 
of the dynamic system 

^W=^«(v,«,^(^)(^W)), (13) 

where S,^'^' {■) is determined by requirement 4 of Assumption[A 

3 Main Results 

In this section we present the main theoretical results of the work. We start 
by introducing adaptive bases and show the algorithms that are derived from 
choosing different approximating schemes. 

3.1 Adaptive Bases 

The motivation for adaptive bases is the following. Consider an agent that 
chooses a basis for the critic in order to approximate the value function. The 
basis which one chooses with no prior knowledge might not be suitable for the 
problem at hand. A poor subspace where the actual value function is poorly 
supported may be chosen. Thus, one might prefer to choose a parameterized 
basis that has additional flexibility by changing a small set of parameters. 

We propose to consider a basis that is linear in some of the parameters but 
has several other parameters that allow greater flexibility. In other words, we 
consider bases that are linear with respect to some of the terms (related to the 
fast time scale), and nonlinear with respect to the rest (related to the slow time 
scale). The idea is that most probably one does not lose from such an approach 
in general if it fails, but in many cases it is possible to obtain better fitness and 
thus a better performance, due to this additional flexibility. Mathematically, 

J{x,r,s)=cl){x,s)'^ r, s 6 M^% (14) 

where r is a linear parameter related to the fast time scale, and s is the non-linear 
parameter related to the slow time scale. In the view of ([5]), we note that from 
now on the matrix <P depends on s, i.e., <P = (Ps, and in matrix form we have 
J = ^sf, but for ease of exposition we drop the dependency on s. The following 
assumption is needed for proving later results. 

Assumption 6. The columns of the the matrix ^ are linearly independent, 
Kr < N , and <Pr ^ e, where e is a vector of 1 ',s. Moreover, the functions (j) [x, s) 
and d(j) {x, s) /dsi for 1 < i < K^ are Liphschitz in s with a coefficient L^, and 
bounded with coefficient B^. 



Notation comment: for ease of exposition, we drop the dependency on a;„, e.g., 

(j)n = 4>{xn,Sn), 9,1 = gixn)- Denote (j) = (j){x, s 

12. 1[ y is the state fohowed the state x), (p'^ = (^ 

d = d(x, y). Thus, d = g - r] + ij)'^ r - (j)^ r and (i„ = g^ - ??„ + (pn Tn - 4>lrn 



(f)' = (j){y, s) (where as in Section 
a;ji+i, s„), dn = d{xn, Xn+i), and 



3.2 Minimum Square Error and TD 

Assume a basis parameterized as in (|14p . The minimum square error (MSE) is 
defined as 



MSE = -E 
2 



J{x) - J{x) 



The gradient with respect to r is 

1 



V.MSE 



-E 



J{x)-J{x)) (j> KE[d(j>], 



(15) 



where in the approximation wc use the bootstrapping method (see |16| for a 
disussion) in order to get the well known TD algorithm (i.e., substituting J ?a 
TJ). On top of the above TD algorithm, we take a derivative with respect to Si^ 
i = 1, . . . ,Ks, yielding 



aMSE 
dsi 



E 



(j{x) - J{x) 



dJ{x) 
dsi 



E 



dsi 



(16) 



where again we use the bootstrapping method. Note that this equation gives the 
non-linear TD procedure for the basis parameters. We use SA in order to solve 
the stochastic equations p^ and ([TS]) , which together with Theorem [5] is the 
basis for the following algorithm. For technical reasons, we add an requirement 
that the iterates for B and s are bounded, which practically is not constraining 
(see in] for discussion on constrained SA). 

Algorithm 7. Adaptive basis TD (ABTD). 



Vn+l = Vn + al^^ (gn ^ Vn) 

r„+i = r„ +a^^'d„0„. 



'n+l 



ii,n+l 



H 
H 



p 



+ a^^Vn4 



(1) , 50X 

OS, 



l,--.,Ks, 



(17) 
(18) 

(19) 
(20) 



AS) 



r(«) 



where Hp andHp are projection operators into a non-empty open constraints 
set whenever On ^ Hp and s ^ Hg, respectively, and the step size series {all } 
for i =: 1, 2, 3 satisfy Assumption\^ 

We note that this algorithm is an AC algorithm with three time scales: the usual 
two time scales, i.e., choosing {an }^i = yields Algorithm 1 of [5], and the 
third iterates is added for the basis adaptation, which is the slowest. 



3.3 Minimum Square Bellman Error 

The Minimum Square Bellman Error (MSBE) is defined as 



MSBE = ^E 
2 



(tJ{x) - J{x) 



The gradient with respect to r is 

V^MSBE == E [d (0' - (j))] , 
where the derivative with respect to s.;, i = 1, . . . , Kg, is 



9MSBE 



ds. 



= E 






Based on this we have the following SA algorithm, that is similar to Algorithm 
[7] except for the iterates for r„ and s„. 

Algorithm 8. - Adaptive Basis for Bellman Error (ABBE). Consider the iter- 
ates for rj and 9 in Algorithm^ The iterates for r and Si are 



r,i+i = r„ - a^^^dn {(p'n - <Pn) , 



■^2,71+1 



-H 



(s) 



' OS 



dSi 



i = l,...,Ks 



3.4 Minimum Square Projected Bellman Error 

The Minimum Square Projected Bellman Error (MSPBE) is defined as 

= E[d0]'(E[</)0'])-'E[d<^], 



MSPBE = E 



nTj{x) - j{x) 



where the projection operator is defined in © and where the second equality 
was proved by Sutton et al. [17], Section 4. We note that the projection op- 
erator is independent of r but depend on the basis parameter s. Define w = 
(E [(/)0'])~ E [dcj)]. Thus, w is the solution to the equation (E [0(/)']) w = E [d(t>\, 
which yields MSPBE = w'E[d(/)]. Define similar to g] section 6.3.3 Ar + b = 
E [dcj)], where A = E[0(0' - (j)Y] and h = E[0(.g - rj)]. Define A^'''> to be the i-th 
column of A. For later use, we give here the gradient of w with respect to r and 
s in implicit form 



E 



E 



dsi 



ds. 





= aw, 


(fxp w = 


dA db 

OSi OSi 



Denote by An, A|„,6f „,w„, w[„, and w|„ the estimators at time n oi A, dA/dsi, 

db/dsi, w, dw/dvi, and dw/dsi, respectively. Define An to be the i-th column 
of An- Thus, the SA iterations for these estimators are 

An+l = An + al^^ ((f>n {<j>7i - 0ri+l) - ^n j , 
Wn+1 =Wn+ a^n^ (0„d„ - 0„(^^W„) , 

where < an > satisfies Assumption [31 Next, we compute the gradient of the 
objective function MSPBE with respect to r and s and suggest a gradient descent 
algorithm to find the optimal value. Thus, 



9MSPBE 



E[dcb]'^—w'^+w'^—E[dcl,], 



dvi dri dr> 

aMSPBE _ dw^ -T- i9E [d0] 

dsi dsi dsi 

The following algorithm gives the SA iterates for r and s, where the iterates 
for 77 and 6 are the same as in Algorithms [7] and [5] and therefore omitted. This 
algorithm has four time scales. The fastest time scale, related to the step sizes 
{a„ }, is the estimators time scale, i.e., the estimators for A, dA/dsi, dh/dsi, 
w, dw/dri, and dw/dsi. The linear parameters of the critic, i.e., r and 77, related 
to the step sizes {an }, estimated on the second fastest time scale. The actor 

parameter 6, related to the step sizes {an }, is estimated on the second slowest 
time scale. Finally, the critic non-linear parameter s, related to the step sizes 
{a„ }, is estimated on the slowest time scale. We note that a version where the 
two fastest times scales operate on a joint single fastest time scale is possible, 
but results additional technical difficulties in the convergence proof. 

Algorithm 9. - Adaptive Basis for PBE (ABPBE). Consider the iterates for rj 
and 9 in Algorithm^ The iterates for r and s are 



ri^n+i = ri,„ - a^n^ (d„(/),]^u;[_„ + w,yylWri_„r„j , 

Si,n+l = Si^n - Oi[l^ \dn4>l'^i,n + (A^'^" + Km) ^") ' « = 1, ■ 



4 Analysis 

In this section wc prove the convergence of the previous section Algorithm [7] 
and [HI We omit the convergence proof of Algorithm [5] that is similar to the 
convergence proof of Algorithm [Sj 

4.1 Convergence of ABTD 

We begin by stating a theorem regarding the ABTD convergence. Due to space 
limitations, we give only a proof sketch based on the convergence proof of The- 
orem 2 of Bhatnagar et al. [5]. The self-contained proof under more general 
conditions is left to the long version of this work. 

Theorem 10. Consider Algorithm^and suppose Assumption\^\^ and\^ hold. 
Then, the iterates (J17p - (|20p of Algorithm[^ converge w.p. 1 to a point that locally 
maximizes rj and solves the equation Fi[dV s(f^ r] = 0. 

Proof. (Sketch) There are three time-scales in pTjl - ipO)) . therefore, we wish to 
use Theorem[5J i.e., we need to prove that the requirements of Assumption 2] are 
valid w.r.t. to all iterations, i.e., rjn, ?■„, 6'„, and s„. 

Requirement 1-4 w.r.t. iterates rjn, r„, 0„. Bhatnagar ct al. proved in j5] 
that (fT7 |) -(fT9 |) converge for a specific s. Assumption [6] implies that the require- 
ments 1-4 of Assumption |4] are valid regarding the iterates of r/n , r„ and On 
uniformly for all s S IR ° . Therefore, it sufficient to prove that on top of ([T7| - 
p^ also iterate ^U\\ converges, i.e., that requirements 1-4 of Assumption |3] are 
valid w.r.t. s„. 

Requirement 1 w.r.t. iterate .s„. Define the cr-algebra J-n — o'{r]k,rk,Ok,Sh : 
k < n), and define F^") ^ E[g„-7y„| J-„], F^'^ ^ EK</.„| J-„], i^f ' ^ iJJf^E[^„d„| J-„], 

Fi^') ^ 4^)E[d„^r„|J-„], and m[%\ 4 i/(f)[(d„^r„) - F^^-']. Thus, m 
can be expressed as 



^i.n-^-l 



+ aW(F(-)+M(-l). (21) 



W fW .„a f(-) 



Trivially, using Assumption [51 Fn , Fn , and Fn are Liphschitz, with respect 
to s, with coefficients i??, L^, and L^, respectively. Also, F„ is Liphschitz 
with respect to ry, r, and 9 with coefficients 1, i?^, and 1, respectively. Thus, 
requirement 1 of Assumption |4] is valid. 

Requirements 2 and 3 w.r.t. iterate s„. By construction, the iterate s„ is 
bounded. Requirement 3 of Assumption 0] is valid using the boundedness of the 

(s) 

martingale difference noise M^^\ that implies, using the martingale convergence 

theorem [3], that the martingale X)n o^" -^n+i converges. 

Requirement 4 w.r.t. iterate s„. Using the result of Bhatnagar et al. [5], the 

fast time scales converge w.r.t. the slow time scale. Thus, Requirement 4 is valid 
based on the fact that the iterates p7)) -([T ^ converge. D 



4.2 Convergence of Adaptive Basis for Bellman Error 

We begin by stating the theorem and then we prove it. 

Theorem 11. Consider Algorithm\^ and suppose that Assumption[ll\^ and\di 
hold. Then, Algorithm\^ converge w.p. 1 to a point that locally maximizes rj and 
locally minimizes E[d^]. 

Proof. (Sketch) To use Theorem [S] we need to check that Assumption 2] is vahd. 
Define the ct- algebra J^n — cr{rik,rk,Ok,Sk : k < n), and define Fn' = E[g„ 
77„|J-„], M^i, 4 (.g„ - 7y„) - F^"), 

-{dnift^n+l - 4>n)) 



(r) 



-E[d„( 



r(si) A 



F^' ^ -E[d„(0„+1 - (/)„)|^n], A'Ci^ 



{9) A 
n+1 ~ 



{ipndn) 



pW pi'^i) A 



i^r„ - ^r„)| J-„], and M^^i ^ -(d„(^^r„) - ^r„)) - F^ . 

On the fast time scale (which is related to a„ ), as in Theorem [TOl rjn con- 
verges to E[g(x)]. On the same time scale we need to show that the iterate for 
r„ converges. Using the above definitions, we can write the iteration r„ as 



r„+i = r„ + a(3) (^FJ;^ + M^ll,) . 



(22) 



We use Theorem 2.2 of Borkar and Meyn [7] to achieve this. Briefiy, this theorem 
states that given an iteration as ([22]) . this iteration is bounded w.p.l if 



(Al) The process Fn is Lipschitz, the function Foo((t) = limo-^oo -f '')(crr)/r 

is Lipschitz, and F^{a) is asymptotically stable in the origin. 
(A2) The sequence M^*^]^ is a martingale difference noise and for some Cq 



E 



Ar) 



(M;Vi)^|J-„ <Co(l+||r„||^). 



(r) 

Trivially, the function Fn is Lipschitz continuous, and we have 

lim F'^''\ar)/r = -E [{(/)' - </>)(0' - </>)^|] r. 



Thus, it is easy to show, using Assumption [6l that the ODE f = Foo has a 
unique global asymptotically stable point at the origin and (Al) is valid. For 
(A2) we have 



E 



M(ri + l)W 



J~f^ 



<E 



\dn {<t>'n 



-Jn 



<2{Bg + B„ + ABlrnf = K"{1 + |!r„f ), 



where the first inequality results from the inequality E[(a: — E[x])^] < E[x^], 
and the second inequality results from the uniform boundedness of the involved 
variables. We note that the related ODE for this iteration is given by f = F^'^\ 
and the related Lyapunov function is given by E[d^]. Next, we need show that 
under the convergence of the fast time scales for 7y„ and r„, the slower iterate 



for 9 converges. The proof of this is identical to that of Theorem 2 of |5] and is 
therefore omitted. We are left with proving that if the fast timescales converge, 
i.e., the iterates 7y„, r„, and 0„, then the iterate s„ converge as well. The proof 
follows similar lines as of the proof for Sn in the proof of Theorem [TUl whereas 
here the iterate s„ converge to the stable point of the ODE s = VsE[(i(x, y)^]. 

D 

5 Simulations 

In this section we report empirical results applying the algorithms on two types 
of problems: GARNET problems [T] and the mountain car problem. 

5.1 Garnet problems 

The GARNEilj problems |1I5J are a class of randomly constructed finite MDPs 
serving as a test-bench for RL algorithms. A garnet problem is characterized 
by four parameters and is denoted by GARNEt(X, f/, B, a). The parameter X is 
the number of states, U is the number of actions, B is the branching factor, and 
a is the variance of each transition reward. When constructing such a problem, 
we generate for each state a reward, distributed according to A/'(0, 1). For each 
state-action the reward is distributed according to J\f{g{x),(7'^). The transition 
matrix for each action is composed of B non-zero terms. We consider the same 
GARNET problems as those simulated by [S]. For the critic's feature vector, we use 
the basis functions (f>{x, s) = cos (|s -I- gx,d), where x ~ 1, . . . ,N, 1 < d < Kr, 
s £ R^, and gx,d are i.i.d. uniform random phases. Note that only one parameter 
in this simulation controls the basis functions. The actor's feature vectors are of 
size Ka X \U\, and arc constructed as 

^{x,u)^{0^^,^{x,sit = 0)), OT^ • 

The policy function is fi{u\x,e) = e*^«(^'"VEu'e(7 ^^^^^'''"'''- Bhatnagar et al. 
[S] reported simulation results for two garnet problems: GARNEt(30,4, 2,0.1) 
and GARNEt(100, 10, 3,0.1). We based our simulations on these results where 
the time steps are identical to those of [S]. The GARNEt(30,4, 2,0.1) problem 
(Fig. [T]left pane) was simulated for Kr = 4 (two lower graphs) and Kr = 12 
(two upper graphs), where each graph is an average of 100 repeats. The GAR- 
NEt(100, 10, 3,0.1) problem (Fig. [T] right pane) was simulated for Kr = 4 (two 
lower graphs) and Kr = 12 (two upper graphs), where each graph is an average 
of 100 repeats. We can see that in such problems there is an evident advantage 
to an adaptive base, which can achieve additional fitness to the problem, and 
thus even for low dimensional problems the adaptation may be crucial. 

5.2 The Mountain Car 

The mountain car task (see [T5| or [16] for details) is a physical problem where 
a car is positioned randomly between two mountains (see Fig. [2] left pane) and 
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Fig. 1. Results for garnet(30,4, 2, 0.1) (left pane) and GARnet(100, 10, 3, 0.1) (right 
pane) where circled graphs are for adaptive bases. In each graph the lower two graphs 
are for Kr — 4 and the upper graphs are for Kr = 12. See text for detail. 

needs to climb the right mountain, but the engine of the car does not support 
such a straight climb. Thus, the car needs to accumulate sufficient gradational 
energy, by applying back and forth actions, in order to succeed. 

We applied the adaptive basis TD algorithm on this problem. We chose the 
critic basis functions to be radial basis functions (RBF) (see [5]), where the value 

function is represented by J2i=i ^i '~^^P{~{P ~ ^i J^ /^p i~ {'" ~ ^i Y ht i\- The 



„(p) 



centers of the RBFs are parameterized by {sf\si')fLi while the variance is 
represented by (Sp_i, s^.Jfii- In the right pane of Fig. [2] we present simulation 
results for 4 cases: SARSA (blue dash) which is based on the implementation 
of [TS], AC (red dash-dot) with 64 basis functions uniformly distributed on the 
parameter space, ABTD with 64 basis functions (magenta dotted) where both 
the location and the variance of the basis functions can adapt, ABAC with 16 
basis functions (black solid) with the same adaptation. We see that the adaptive 
basis gives a significant advantage in performance. Moreover, we see that even 
with small number of parameters, the performance is not affected. In the middle 
pane, the dynamics of a realization of the basis functions is presented where 
the dots and circles are the initial positions and final positions of the basis 
functions, respectively. The circle sizes are proportional to the basis functions 
standard deviations, i.e., (sp,i, Si,,i).f£i . 



5.3 The Performance of Multiple Time Scales vs. Single Time Scale 

In this section we discuss the differences in performance between the MTS al- 
gorithm to the STS algorithms. Unlike mistakenly thought, neither MTS algo- 
rithms nor STS algorithms have advantage in terms of convergence. This dif- 
ference comes from the fact that both methods perform the gradient algorithm 
differently, thus, they may result different trajectories. In Fig.[3]we can see a case 
on a GARNEt(30,5,5,0.1) where the MTS ABTD algorithm (upper red diamond 
graph) has an advantage over STS ABTD algorithms or MTS static basis AC 
algorithm as in [5] (rest of the graphs) . We note that this is not always the case 
and it depends on the problem parameters or the initial conditions. 
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Fig. 2. (left pane) illustration of the mountain car task, (middle pane) Realization 
of ABTD with 16 basis functions where the red dots are the basis functions initial 
position and the circles are their final position. The radii are proportional to the vari- 
ance. The rectangle represents the bounded parameter set of the car. (right pane) 
Simulation result for the mountain car problem with solutions of SARSA (blue dash) 
AC (red dash-dot) AB-AC with 64 basis functions (magenta dotted) AB-AC with 16 
basis functions (black solid). 
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Fig. 3. Results for GARNEt(30, 5, 5, 0.1) for Kr — 8. The upper diamond red graph is 
MTS ABTD algorithm, the circled green graph is STS ABTD acting on slow time scale, 
the blue crossed line is MTS static basis AC algorithm as in [5], and the black stared 
line is STS ABTD acting on fast time scale. Each graph is average of 100 simulation 
runnings. 



6 Discussion 



We introduced three new AC based algorithms where the critic's basis is adap- 
tive. Convergence proofs, in the average reward case, were provided. We note 
that the algorithms can be easily transformed to discounted reward. When con- 
sidering other target functions, more AC algorithms with adaptive basis can 
be devised, e.g., considering the objective function ||E[(i(^]p yields A^TD and 
GTD(O) algorithms [H]. Also, mixing the different algorithm introduced in here, 
can yield new algorithms with some desired properties. For example, we can 
devise an algorithm where the linear part is updated similar to ((T5)) and the 
non-linear part is updated similar to (|2ip . Convergence of such algorithms will 
follow the same lines of proof as introduced here. 



The advantage of adaptive bases is evident: they reheve the domain expert 
from the task of carefully designing the basis. Instead, he may choose a flexible 
basis, where one use algorithms as introduced here to adapt the basis to the prob- 
lem at hand. From a methodological point of view, the method we introduced in 
this paper demonstrates how to easily transform an existing RL algorithm to an 
adaptive basis algorithm. The analysis of the original problem is used to show 
convergence of the faster time scale and the slow time scale is used for modifying 
the basis, analogously to "code reuse" concept in software engineering. 
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